This files is a summary of the work made by GMM5 students on data coming from ENVT study.
The data are processed using the following packages:
library(mixOmics)
library(metagenomeSeq)
library(reshape2)
Data are given in two files included in the directory data :
abondances.csv that contains microbiote data:df_abundances <- read.delim("../data/abondances.csv", sep = ",",
stringsAsFactors = FALSE)
summary(df_abundances[ ,1:15])
## X.blast_taxonomy blast_subject blast_perc_identity
## Length:406 Length:406 Min. : 94.18
## Class :character Class :character 1st Qu.: 99.39
## Mode :character Mode :character Median : 99.80
## Mean : 99.32
## 3rd Qu.:100.00
## Max. :100.00
## blast_perc_query_coverage blast_evalue blast_aln_length
## Min. : 98.13 Min. :0 Min. :430.0
## 1st Qu.:100.00 1st Qu.:0 1st Qu.:466.2
## Median :100.00 Median :0 Median :480.0
## Mean : 99.96 Mean :0 Mean :477.4
## 3rd Qu.:100.00 3rd Qu.:0 3rd Qu.:490.0
## Max. :100.00 Max. :0 Max. :512.0
## seed_id observation_name observation_sum
## Length:406 Length:406 Min. : 224.0
## Class :character Class :character 1st Qu.: 588.5
## Mode :character Mode :character Median : 979.5
## Mean : 9753.7
## 3rd Qu.: 2738.0
## Max. :880523.0
## NG.10214_EN10A_lib136338_4869_1 NG.10214_EN10B_lib136339_4869_1
## Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.0 Median : 0.0
## Mean : 108.4 Mean : 108.4
## 3rd Qu.: 12.0 3rd Qu.: 12.0
## Max. :17044.0 Max. :17663.0
## NG.10214_EN11A_lib136340_4869_1 NG.10214_EN11B_lib136341_4869_1
## Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 1.0 Median : 1.0
## Mean : 108.4 Mean : 108.4
## 3rd Qu.: 6.0 3rd Qu.: 6.0
## Max. :24223.0 Max. :23739.0
## NG.10214_EN15A_lib136342_4869_1 NG.10214_EN15B_lib136343_4869_1
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 3.00 Median : 4.00
## Mean : 108.37 Mean : 108.37
## 3rd Qu.: 15.75 3rd Qu.: 15.75
## Max. :28421.00 Max. :27946.00
The first 9 columns contain information about the different bacteria identified within samples. The following columns contain the two technical replicates (identified by “A” and “B”) of all the farms involved in the study (the farm is identified by a number preceeding the letter “A”/“B”).
Note that some “blast taxonomy” are duplicated:
unique(names(which(table(df_abundances[ ,1]) > 1)))
## [1] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Corynebacteriaceae;Corynebacterium 1;Corynebacterium sp."
## [2] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Corynebacteriaceae;Corynebacterium 1;unknown species"
## [3] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Corynebacteriaceae;Corynebacterium;unknown species"
## [4] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Dietziaceae;Dietzia;Dietzia sp."
## [5] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Nocardiaceae;Rhodococcus;Rhodococcus sp."
## [6] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Brevibacteriaceae;Brevibacterium;Brevibacterium antiquum"
## [7] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Brevibacteriaceae;Brevibacterium;unknown species"
## [8] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Intrasporangiaceae;Janibacter;unknown species"
## [9] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Microbacteriaceae;Leucobacter;unknown species"
## [10] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Micrococcaceae;Arthrobacter;Arthrobacter sp."
## [11] "Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces sp."
## [12] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;unknown species"
## [13] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Petrimonas;unknown species"
## [14] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Porphyromonas;unknown species"
## [15] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Proteiniphilum;unknown species"
## [16] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Alloprevotella;unknown species"
## [17] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella 1;unknown species"
## [18] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella 2;unknown species"
## [19] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella 9;unknown species"
## [20] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Rikenellaceae;Rikenellaceae RC9 gut group;unknown species"
## [21] "Bacteria;Bacteroidetes;Flavobacteriia;Flavobacteriales;Flavobacteriaceae;Bergeyella;unknown species"
## [22] "Bacteria;Bacteroidetes;Flavobacteriia;Flavobacteriales;Flavobacteriaceae;Flavobacterium;unknown species"
## [23] "Bacteria;Bacteroidetes;Sphingobacteriia;Sphingobacteriales;Sphingobacteriaceae;Olivibacter;unknown species"
## [24] "Bacteria;Bacteroidetes;Sphingobacteriia;Sphingobacteriales;Sphingobacteriaceae;Pedobacter;unknown species"
## [25] "Bacteria;Firmicutes;Bacilli;Bacillales;Planococcaceae;Caryophanon;Caryophanon latum"
## [26] "Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Jeotgalicoccus;unknown species"
## [27] "Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus sp."
## [28] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Aerococcaceae;Facklamia;unknown species"
## [29] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Aerococcaceae;unknown genus;unknown species"
## [30] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Carnobacteriaceae;Trichococcus;unknown species"
## [31] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Carnobacteriaceae;unknown genus;unknown species"
## [32] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;unknown species"
## [33] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pluranimalium"
## [34] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;unknown species"
## [35] "Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae 1;Clostridium sensu stricto 1;unknown species"
## [36] "Bacteria;Firmicutes;Clostridia;Clostridiales;Family XI;Peptoniphilus;unknown species"
## [37] "Bacteria;Firmicutes;Clostridia;Clostridiales;Lachnospiraceae;Blautia;unknown species"
## [38] "Bacteria;Firmicutes;Clostridia;Clostridiales;Lachnospiraceae;Lachnoclostridium;unknown species"
## [39] "Bacteria;Firmicutes;Clostridia;Clostridiales;Lachnospiraceae;Lachnospiraceae NK3A20 group;unknown species"
## [40] "Bacteria;Firmicutes;Clostridia;Clostridiales;Peptostreptococcaceae;Intestinibacter;unknown species"
## [41] "Bacteria;Firmicutes;Clostridia;Clostridiales;Peptostreptococcaceae;Peptoclostridium;unknown species"
## [42] "Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae;Faecalibacterium;unknown species"
## [43] "Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae;Ruminococcaceae UCG-005;unknown species"
## [44] "Bacteria;Firmicutes;Erysipelotrichia;Erysipelotrichales;Erysipelotrichaceae;Erysipelothrix;unknown species"
## [45] "Bacteria;Fusobacteria;Fusobacteriia;Fusobacteriales;Leptotrichiaceae;unknown genus;unknown species"
## [46] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Hyphomicrobiaceae;Devosia;unknown species"
## [47] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Methylobacteriaceae;Methylobacterium;Methylobacterium sp."
## [48] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Phyllobacteriaceae;Aquamicrobium;unknown species"
## [49] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Rhizobium;Agrobacterium tumefaciens"
## [50] "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Alcaligenaceae;Alcaligenes;Alcaligenes faecalis"
## [51] "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Comamonas;unknown species"
## [52] "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Oxalobacteraceae;Massilia;unknown species"
## [53] "Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia-Shigella;Escherichia coli"
## [54] "Bacteria;Proteobacteria;Gammaproteobacteria;Pasteurellales;Pasteurellaceae;Mannheimia;Mannheimia sp."
## [55] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Moraxella;Moraxella oblonga"
## [56] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Moraxella;unknown species"
## [57] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas sp."
## [58] "Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Luteimonas;unknown species"
## [59] "Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Stenotrophomonas;Stenotrophomonas sp."
## [60] "Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Stenotrophomonas;unknown species"
## [61] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovirhinis"
as well as some of the bast id themselves
unique(names(which(table(df_abundances[ ,2]) > 1)))
## [1] "AB680687.1.1434" "AB681778.1.1454" "AF224286.1.1534"
## [4] "AJ491302.1.1509" "AM183097.1.1441" "AY243344.1.1501"
## [7] "AY725811.1.1474" "FJ672272.1.1408" "FJ674571.1.1375"
## [10] "FJ674662.1.1457" "GQ358834.1.1513" "GQ358846.1.1455"
## [13] "HM325988.1.1341" "HM327870.1.1341" "JX986968.1.1511"
## [16] "KC894531.1.1498"
We need to know what to do with those…?
Also, species names are extracted (last not unknown name) by
species <- sapply(df_abundances[ ,1], function(aname)
unlist(strsplit(aname, ";")))
species <- sapply(species, function(avect) {
find_unknown <- grep("unknown", avect)
if (length(find_unknown) > 0) {
return(avect[-find_unknown])
} else return(avect)
})
species <- unlist(sapply(species, function(avect) avect[length(avect)]))
species <- unname(species)
#l'esp?ce c'est le dernier mot de chaque ligne avec les bact?ries
First, the two technical replicates are merged (simple sums as the counts have already been normalized to identical library sizes)
abundances <- df_abundances[ ,grep("A_", colnames(df_abundances))] +
df_abundances[ ,grep("B_", colnames(df_abundances))]
dim(abundances)
## [1] 406 45
which leads to 45 samples (columns) in which 406 bacteria have been observed (rows).
Condition (“EN” or “LBA”) is extracted from column names:
condition <- rep("LBA", ncol(abundances))
condition[grep("EN", colnames(abundances))] <- "EN"
table(condition)
## condition
## EN LBA
## 22 23
Also, the farm identifier is extracted from column names:
id_abundances <- as.character(colnames(abundances))
id_abundances <- sapply(id_abundances, function(ac)
substr(ac, nchar(ac) - 19, nchar(ac) - 18))
id_abundances <- gsub("A", "0", id_abundances)
id_abundances <- gsub("N", "0", id_abundances)
table(id_abundances)
## id_abundances
## 01 03 04 06 07 09 10 11 15 16 17 18 19 20 21 23 24 25 26 28 29 30 31
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
All but one farm (idenfier: 29) have been sampled twice, once for each condition.
list_doublons<-unique(names(which(table(df_abundances[ ,1]) > 1)))
fusion_doublons<-as.data.frame(t(sapply(list_doublons,FUN=function(x){as.data.frame(t(colSums(abundances[grep(x,df_abundances[,1]),])))})))
list_pas_doublons <- unique(names(which(table(df_abundances[ ,1]) == 1)))
data <- rbind(abundances[match(list_pas_doublons,df_abundances[,1]),],fusion_doublons)
data contient maintenant aucun doublon : les doublons ont été additionnés entre eux.
rownames(data)<-c(list_pas_doublons,list_doublons)
colnames(data) <- id_abundances
list_bacteria<-rownames(data) # on stocke le nom des bactéries
data <- data[,-grep("29",id_abundances)]
colnames(data) <- paste0("sample",id_abundances[-grep("29",id_abundances)],"_",condition[-grep("29",id_abundances)])
dim(data)
## [1] 273 44
The effect of different normalization is first explored by analyzing the distributions of the counts in the first sample before and after normalization. Distribution before normalization is provided as:
abundances <- as.data.frame(sapply(data,as.numeric))
df<-abundances
names(df) <- paste0("sample", 1:ncol(df))
dim(df)
## [1] 273 44
ggplot(df, aes(x = sample1 +1)) + geom_histogram(bins = 50) + scale_y_log10() +
theme_bw() + xlab("counts (sample 1)") + ggtitle("Sample 1 distribution")
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 41 rows containing missing values (geom_bar).
and with a log-transformation by
ggplot(df, aes(x = sample1 + 1)) + geom_histogram(bins = 50) +
theme_bw() + xlab("counts (sample 1)") + scale_x_log10() +
ggtitle("Sample 1 distribution (log scale)")
It is commun in metagenomic datasets to perform TSS (Total Sum Scaling) before further normalization. TSS transformation computes relative abundances: \[ y_{ij} = \frac{n_{ij}}{\sum_{k=1}^p n_{ik}} \] for \(n_{ij}\) the counts of species \(j\) in sample \(i\), \(p\) the number of species and \(n\) the number of individuals.
abundances_TSS <- apply(abundances, 1, function(asample)
asample / sum(asample))
df <- as.data.frame(t(abundances_TSS))
names(df) <- paste0("sample", 1:ncol(df))
dim(df)
## [1] 273 44
ggplot(df, aes(x = sample1 + 1)) + geom_histogram(bins = 50) +
theme_bw() + xlab("relative abundance (sample 1)") + scale_x_log10()
The next two histograms are based on the normalized counts with:
abundances_CLR <- logratio.transfo(abundances_TSS, logratio = "CLR",
offset = 1)
class(abundances_CLR) <- "matrix"
df <- data.frame(t(abundances_CLR)) # on transpose pour avoir une matrice 406*45 comme les autres
names(df) <- paste0("sample", 1:ncol(df))
dim(df)
## [1] 273 44
ggplot(df, aes(x = sample1)) + geom_histogram(bins = 50) + theme_bw() +
xlab("counts (sample 1)")
abundances_ILR <- logratio.transfo(abundances_TSS, logratio = "ILR",
offset = 1)
class(abundances_ILR) <- "matrix"
df <- data.frame(t(abundances_ILR)) # on transpose pour avoir une matrice 406*45 comme les autres
names(df) <- paste0("sample", 1:ncol(df))
dim(df)
## [1] 272 44
ggplot(df, aes(x = sample1)) + geom_histogram(bins = 50) + theme_bw() +
xlab("counts (sample 1)")
abundances_CSS <- newMRexperiment(abundances)
abundances_CSS <- cumNorm(abundances_CSS)
## Default value being used.
df <- data.frame(MRcounts(abundances_CSS))
names(df) <- paste0("sample", 1:ncol(df))
dim(df)
## [1] 273 44
ggplot(df, aes(x = sample1 + 1)) + geom_histogram(bins = 50) + theme_bw() +
xlab("counts (sample 1)") + scale_y_log10() + scale_x_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 16 rows containing missing values (geom_bar).
The less asymetric distribution seems to be the one obtained with the CLR transformation and the log-transformed CSS.
Distributions of all samples according to the type of transformation and the sample is provided below:
df_log <- log10(abundances + 1)
names(df_log) <- paste0("Sample", 1:ncol(df_log))
df_log <- melt(df_log)
## No id variables; using all as measure variables
df_CLR <- data.frame(t(abundances_CLR))
names(df_CLR) <- paste0("Sample", 1:ncol(df_CLR))
df_CLR <- melt(df_CLR)
## No id variables; using all as measure variables
df_ILR <- data.frame(t(abundances_ILR))
names(df_ILR) <- paste0("Sample", 1:ncol(df_ILR))
df_ILR <- melt(df_ILR)
## No id variables; using all as measure variables
df_CSS <- data.frame(log(MRcounts(abundances_CSS)) + 1)
names(df_CSS) <- paste0("Sample", 1:ncol(df_CSS))
df_CSS <- melt(df_CSS)
## No id variables; using all as measure variables
all_sizes <- c(nrow(df_log), nrow(df_CLR), nrow(df_ILR), nrow(df_CSS))
df <- data.frame(rbind(df_log, df_CLR, df_ILR, df_CSS),
"type" = rep(c("log", "CLR", "ILR", "log-CSS"), all_sizes))
#on transforme notre matrice abundances en vecteur, en gros pour le veau 1 - premier valeur, veau 1 - 2?me valeur et ainsi de suite
ggplot(df, aes(x = variable, y = value)) + geom_boxplot() + theme_bw() +
facet_wrap(~ type, scales = "free_y") + xlab("samples") +
theme(axis.text.x = element_blank())
## Warning: Removed 6287 rows containing non-finite values (stat_boxplot).
# echelle dif?rente pour chaque boxplot
A first exploratory analysis is performed with PCA on (merged) raw counts with log transformation:
pca_raw <- pca(log(t(abundances) + 1), ncomp = ncol(abundances),
logratio = 'none')
plot(pca_raw)
that shows a good percentage of explained variance for the first axis.
Projection of the individuals on the first two PCs also shows a good separation between the two conditions:
plotIndiv(pca_raw,
comp = c(1,2),
pch = 16,
ind.names = FALSE,
group = condition[-grep("29",id_abundances)],
col.per.group = color.mixo(1:2),
legend = TRUE,
title="PCA for log-tranformed data")
The same analysis is used with TSS normalized counts subsequently transformed by CLR or ILR (which is the expected analysis):
pca_CLR <- pca(abundances_TSS + 1, ncomp = nrow(abundances_TSS),
logratio = 'CLR')
plot(pca_CLR)
plotIndiv(pca_CLR,
comp = c(1,2),
pch = 16,
ind.names = FALSE,
group = condition[-grep("29",id_abundances)],
col.per.group = color.mixo(1:2),
legend = TRUE,
title="PCA for CLR-tranformed data")
pca_ILR <- pca(abundances_TSS + 1, ncomp = nrow(abundances_TSS) - 1,
logratio = 'ILR')
plot(pca_ILR)
plotIndiv(pca_ILR,
comp = c(1,2),
pch = 16,
ind.names = FALSE,
group = condition[-grep("29",id_abundances)],
col.per.group = color.mixo(1:2),
legend = TRUE,
title="PCA for ILR-tranformed data")
log_CSS <- log(MRcounts(abundances_CSS) + 1)
pca_CSS <- pca(t(log_CSS), ncomp = ncol(log_CSS))
plot(pca_CSS)
plotIndiv(pca_CSS,
comp = c(1,2),
pch = 16,
ind.names = FALSE,
group = condition[-grep("29",id_abundances)],
col.per.group = color.mixo(1:2),
legend = TRUE,
title="PCA for CSS-tranformed data")
A first PLS-DA is computed (with 10-fold CV) to check the efficiency of the method and which type of distance to use in its computation.
clean_log <- data.frame(log(t(abundances) + 1))
names(clean_log) <- list_bacteria
clean_condition <- factor(condition[id_abundances != "29"])
clean_id <- factor(id_abundances[id_abundances != "29"])
colnames(clean_log)<-paste0("var_",1:ncol(clean_log))
set.seed(11)
res_plsda <- plsda(clean_log, clean_condition, ncomp = nlevels(clean_condition))
res_perf <- perf(res_plsda, validation = 'Mfold', folds = 5,
progressBar = FALSE, nrepeat = 20)
plot(res_perf, overlay = 'measure', sd = TRUE)
plotIndiv(res_plsda , comp = c(1, 2), ind.names = FALSE, ellipse = TRUE,
legend = TRUE, title = "PLS-DA (log)")
PLS-DA shows a good separation between the two groups and indicates that the Mahalanobis distance provides the lower overall classification error.
Then, sparse PLS-DA is used (with the multilevel approach) to check which number of components to select.
set.seed(33)
res_plsda <- tune.splsda(clean_log, clean_condition,
ncomp = nlevels(clean_condition),
multilevel = clean_id,
test.keepX = c(seq(5, 200, 5)), validation = 'Mfold',
folds = 10, dist = 'mahalanobis.dist', nrepeat = 10,
progressBar = F)
## Splitting the variation for 1 level factor.
plot(res_plsda)
sel_keepX <- res_plsda$choice.keepX[1:2]
sel_keepX
## comp1 comp2
## 5 5
Finally sparse PLS-DA is performed and the variables explaining the two types of samples are obtained:
res_splsda <- splsda(clean_log, clean_condition,
ncomp = nlevels(clean_condition), multilevel = clean_id,
keepX = sel_keepX)
## Splitting the variation for 1 level factor.
plotIndiv(res_splsda, comp = c(1,2), ind.names = FALSE, ellipse = TRUE,
legend = TRUE, title = 'sPLS-DA (log)')
head(selectVar(res_splsda, comp = 1)$name)
## [1] "var_209" "var_6" "var_268" "var_260" "var_253"
plotLoadings(res_splsda, comp = 1, method = 'mean', contrib = 'max',
size.title = 1)
plotLoadings(res_splsda, comp = 2, method = 'mean', contrib = 'max',
size.title = 1)
indices<-as.numeric(gsub("var_","",selectVar(res_splsda, comp = 1)$name[1:5]))
rbind(selectVar(res_splsda, comp = 1)$name[1:5],list_bacteria[indices])
## [,1]
## [1,] "var_209"
## [2,] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovoculi M165/69"
## [,2]
## [1,] "var_6"
## [2,] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Corynebacteriaceae;Corynebacterium 1;Corynebacterium camporealensis"
## [,3]
## [1,] "var_268"
## [2,] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Moraxella;unknown species"
## [,4]
## [1,] "var_260"
## [2,] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Phyllobacteriaceae;Aquamicrobium;unknown species"
## [,5]
## [1,] "var_253"
## [2,] "Bacteria;Firmicutes;Clostridia;Clostridiales;Peptostreptococcaceae;Peptoclostridium;unknown species"
clean_CLR <- data.frame(abundances_CLR)
names(clean_CLR) <- paste0("var_",1:ncol(clean_CLR))
set.seed(11)
res_plsda <- plsda(clean_CLR, clean_condition, ncomp = nlevels(clean_condition))
res_perf <- perf(res_plsda, validation = 'Mfold', folds = 5,
progressBar = FALSE, nrepeat = 20)
plot(res_perf, overlay = 'measure', sd = TRUE)
plotIndiv(res_plsda , comp = c(1, 2), ind.names = FALSE, ellipse = TRUE,
legend = TRUE, title = 'PLS-DA (CLR)')
set.seed(33)
res_plsda <- tune.splsda(clean_CLR, clean_condition,
ncomp = nlevels(clean_condition),
multilevel = clean_id,
test.keepX = c(seq(5, 200, 5)), validation = 'Mfold',
folds = 10, dist = 'mahalanobis.dist', nrepeat = 10,
progressBar = F)
## Splitting the variation for 1 level factor.
plot(res_plsda)
sel_keepX <- res_plsda$choice.keepX[1:2]
sel_keepX
## comp1 comp2
## 185 35
res_splsda <- splsda(clean_CLR, clean_condition,
ncomp = nlevels(clean_condition), multilevel = clean_id,
keepX = sel_keepX)
## Splitting the variation for 1 level factor.
plotIndiv(res_splsda, comp = c(1,2), ind.names = FALSE, ellipse = TRUE,
legend = TRUE, title = 'sPLS-DA (CLR)')
head(selectVar(res_splsda, comp = 1)$value)
## value.var
## var_9 -0.2085567
## var_191 0.1751814
## var_202 0.1648123
## var_14 -0.1509830
## var_210 0.1495580
## var_252 -0.1487805
plotLoadings(res_splsda, comp = 1, method = 'mean', contrib = 'max')
plotLoadings(res_splsda, comp = 2, method = 'mean', contrib = 'max')
indices<-as.numeric(gsub("var_","",selectVar(res_splsda, comp = 1)$name[1:5]))
rbind(selectVar(res_splsda, comp = 1)$name[1:5],list_bacteria[indices])
## [,1]
## [1,] "var_9"
## [2,] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Corynebacteriaceae;Corynebacterium 1;Corynebacterium falsenii"
## [,2]
## [1,] "var_191"
## [2,] "Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Providencia;unknown species"
## [,3]
## [1,] "var_202"
## [2,] "Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Stenotrophomonas;Stenotrophomonas maltophilia"
## [,4]
## [1,] "var_14"
## [2,] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Dietziaceae;Dietzia;unknown species"
## [,5]
## [1,] "var_210"
## [2,] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma dispar"
set.seed(11)
res_plsda <- plsda(clean_log, clean_condition, ncomp = nlevels(clean_condition))
res_perf <- perf(res_plsda, validation = 'Mfold', folds = 5,
progressBar = FALSE, nrepeat = 20)
plot(res_perf, overlay = 'measure', sd = TRUE)
plotIndiv(res_plsda , comp = c(1, 2), ind.names = FALSE, ellipse = TRUE,
legend = TRUE, title = 'PLS-DA (log)')
set.seed(33)
res_plsda <- tune.splsda(clean_log, clean_condition,
ncomp = nlevels(clean_condition),
test.keepX = c(seq(5, 200, 5)), validation = 'Mfold',
folds = 10, dist = 'mahalanobis.dist', nrepeat = 10,
progressBar = F)
plot(res_plsda)
sel_keepX <- res_plsda$choice.keepX[1:2]
sel_keepX
## comp1 comp2
## 15 75
res_splsda <- splsda(clean_log, clean_condition,
ncomp = nlevels(clean_condition), multilevel = clean_id,
keepX = sel_keepX)
## Splitting the variation for 1 level factor.
plotIndiv(res_splsda, comp = c(1,2), ind.names = FALSE, ellipse = TRUE,
legend = TRUE, title = 'sPLS-DA (log)')
head(selectVar(res_splsda, comp = 1)$value)
## value.var
## var_209 -0.5036882
## var_6 -0.4711617
## var_268 -0.3776953
## var_260 -0.3359919
## var_253 -0.3284574
## var_234 -0.2762594
plotLoadings(res_splsda, comp = 1, method = 'mean', contrib = 'max',
size.title = 1)
plotLoadings(res_splsda, comp = 2, method = 'mean', contrib = 'max',
size.title = 1)
indices<-as.numeric(gsub("var_","",selectVar(res_splsda, comp = 1)$name[1:5]))
rbind(selectVar(res_splsda, comp = 1)$name[1:5],list_bacteria[indices])
## [,1]
## [1,] "var_209"
## [2,] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovoculi M165/69"
## [,2]
## [1,] "var_6"
## [2,] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Corynebacteriaceae;Corynebacterium 1;Corynebacterium camporealensis"
## [,3]
## [1,] "var_268"
## [2,] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Moraxella;unknown species"
## [,4]
## [1,] "var_260"
## [2,] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Phyllobacteriaceae;Aquamicrobium;unknown species"
## [,5]
## [1,] "var_253"
## [2,] "Bacteria;Firmicutes;Clostridia;Clostridiales;Peptostreptococcaceae;Peptoclostridium;unknown species"
set.seed(11)
res_plsda <- plsda(clean_CLR, clean_condition, ncomp = nlevels(clean_condition))
res_perf <- perf(res_plsda, validation = 'Mfold', folds = 5,
progressBar = FALSE, nrepeat = 20)
plot(res_perf, overlay = 'measure', sd = TRUE)
plotIndiv(res_plsda , comp = c(1, 2), ind.names = FALSE, ellipse = TRUE,
legend = TRUE, title = 'PLS-DA (CLR)')
set.seed(33)
res_plsda <- tune.splsda(clean_CLR, clean_condition,
ncomp = nlevels(clean_condition),
test.keepX = c(seq(5, 200, 5)), validation = 'Mfold',
folds = 10, dist = 'mahalanobis.dist', nrepeat = 10,
progressBar = F)
plot(res_plsda)
sel_keepX <- res_plsda$choice.keepX[1:2]
sel_keepX
## comp1 comp2
## 5 75
res_splsda <- splsda(clean_CLR, clean_condition,
ncomp = nlevels(clean_condition), multilevel = clean_id,
keepX = c(5,25))
## Splitting the variation for 1 level factor.
plotIndiv(res_splsda, comp = c(1,2), ind.names = FALSE, ellipse = TRUE,
legend = TRUE, title = 'sPLS-DA (CLR)')
head(selectVar(res_splsda, comp = 1)$value)
## value.var
## var_9 -0.88787352
## var_191 0.39214002
## var_202 0.23812447
## var_14 -0.03271383
## var_210 0.01154814
plotLoadings(res_splsda, comp = 1, method = 'mean', contrib = 'max',
size.title = 1)
plotLoadings(res_splsda, comp = 2, method = 'mean', contrib = 'max',
size.title = 1)
indices<-as.numeric(gsub("var_","",selectVar(res_splsda, comp = 1)$name[1:5]))
rbind(selectVar(res_splsda, comp = 1)$name[1:5],list_bacteria[indices])
## [,1]
## [1,] "var_9"
## [2,] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Corynebacteriaceae;Corynebacterium 1;Corynebacterium falsenii"
## [,2]
## [1,] "var_191"
## [2,] "Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Providencia;unknown species"
## [,3]
## [1,] "var_202"
## [2,] "Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Stenotrophomonas;Stenotrophomonas maltophilia"
## [,4]
## [1,] "var_14"
## [2,] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Dietziaceae;Dietzia;unknown species"
## [,5]
## [1,] "var_210"
## [2,] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma dispar"
library(randomForest)
Préparation des données :
df_pathogenes<-read.delim("../data/pathogenes.csv",sep = ",")
pathogenes<-df_pathogenes[,-c(1,9)]
pathogenes<-as.data.frame(ifelse(pathogenes=='p',1,0))
pathogenes<-as.data.frame(apply(t(pathogenes),1,as.factor))
id_pathogenes<-df_pathogenes[,1]
id_pathogenes <- gsub("-","0",id_pathogenes)
id_pathogenes <- gsub(" ","",id_pathogenes)
id_pathogenes <- sapply(as.character(id_pathogenes),FUN=function(x){substr(x,nchar(x)-1,nchar(x))})
id_pathogenes <- id_pathogenes[-grep("29",id_pathogenes)]
id_pathogenes <- paste0(id_pathogenes,"_",rev(condition[-grep("29",id_abundances)]))
Matcher les noms des veaux (puisque l’ordre de la liste de veaux dans le fichier pathogenes n’est pas le même que dans le fichier abundances) :
id_abundances_bis <- paste0(id_abundances[-grep("29",id_abundances)],"_",condition[-grep("29",id_abundances)])
Y<-pathogenes[match(id_abundances_bis,id_pathogenes),]
rownames(Y)<-id_abundances_bis
Y contient les variables reponse.
Le nom des batéries n’est pas accepté par RandomForest donc on les renomme. On transpose data pour avoir les variables en colonnes et les individus en lignes.
X <- as.data.frame(t(data))
X <- as.data.frame(sapply(X,as.integer))
colnames(X)<-paste0("var_",1:ncol(X)) # chgt nom des variables pour que RF marche
RSV :fit <- randomForest(Y$Ct.RSV ~ ., data = X,ntree=5000,mtry=40)
fit$confusion
## 0 1 class.error
## 0 37 0 0
## 1 7 0 1
Nothing in class 1…
# variables qui discriminent le mieux:
varImpPlot(fit)
# liste des 10 variables qui discriminent le mieux:
indices<-as.numeric(gsub("var_","",names((fit$importance[order(fit$importance[, 1], decreasing = TRUE), ])[1:10])))
list_bacteria[indices]
## [1] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Ureaplasma;unknown species"
## [2] "Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Stenotrophomonas;Stenotrophomonas sp."
## [3] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;unknown species"
## [4] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma dispar"
## [5] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Microbacteriaceae;Frigoribacterium;Frigoribacterium faeni"
## [6] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis"
## [7] "Bacteria;Actinobacteria;Actinobacteria;Propionibacteriales;Propionibacteriaceae;Propionibacterium;Propionibacterium sp."
## [8] "Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Stenotrophomonas;Stenotrophomonas maltophilia"
## [9] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas aeruginosa"
## [10] "Bacteria;Bacteroidetes;Sphingobacteriia;Sphingobacteriales;Chitinophagaceae;unknown genus;cilia-associated respiratory bacterium 246-57"
# graphique montrant comment reduit l'OOB en fonction du nombre d'arbres generes
plot(fit$err.rate[, 1], type = "l", xlab = "nombre d'arbres", ylab = "erreur OOB")
Prédictions obtenues :
cat(fit$predicted)
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
PI.3 :fit <- randomForest(Y$Ct.PI.3 ~ ., data = X,ntree=1000,mtry=10)
fit$confusion
## 0 1 class.error
## 0 41 0 0
## 1 3 0 1
Nothing in class 1…
# variables qui discriminent le mieux:
varImpPlot(fit)
# liste des 10 variables qui discriminent le mieux:
indices<-as.numeric(gsub("var_","",names((fit$importance[order(fit$importance[, 1], decreasing = TRUE), ])[1:10])))
list_bacteria[indices]
## [1] "&"
## [2] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovirhinis"
## [3] "Bacteria;Bacteroidetes;Sphingobacteriia;Sphingobacteriales;Sphingobacteriaceae;Sphingobacterium;unknown species"
## [4] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus sp."
## [5] "Bacteria;Firmicutes;Negativicutes;Selenomonadales;Veillonellaceae;Megasphaera;Megasphaera elsdenii"
## [6] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Moraxella;Moraxella bovis"
## [7] "Bacteria;Bacteroidetes;Flavobacteriia;Flavobacteriales;Flavobacteriaceae;Bergeyella;unknown species"
## [8] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Moraxella;unknown species"
## [9] "Bacteria;Firmicutes;Clostridia;Clostridiales;Family XIII;unknown genus;unknown species"
## [10] "Bacteria;Firmicutes;Erysipelotrichia;Erysipelotrichales;Erysipelotrichaceae;Faecalicoccus;Faecalicoccus pleomorphus"
# graphique montrant comment reduit l'OOB en fonction du nombre d'arbres generes
plot(fit$err.rate[, 1], type = "l", xlab = "nombre d'arbres", ylab = "erreur OOB")
P.multocida :fit <- randomForest(Y$Ct.P.multocida ~ ., data = X,ntree=1000,mtry=10)
fit$confusion
## 0 1 class.error
## 0 0 7 1
## 1 0 37 0
Nothing in class 0
# variables qui discriminent le mieux:
varImpPlot(fit)
# liste des 10 variables qui discriminent le mieux:
indices<-as.numeric(gsub("var_","",names((fit$importance[order(fit$importance[, 1], decreasing = TRUE), ])[1:10])))
list_bacteria[indices]
## [1] "Bacteria;Fusobacteria;Fusobacteriia;Fusobacteriales;Leptotrichiaceae;unknown genus;unknown species"
## [2] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Micrococcaceae;Arthrobacter;Arthrobacter sp."
## [3] "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Delftia;Delftia acidovorans"
## [4] "Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Leptospiraceae;Leptospira;Leptospira broomii"
## [5] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma dispar"
## [6] "Bacteria;Bacteroidetes;Sphingobacteriia;Sphingobacteriales;Chitinophagaceae;unknown genus;cilia-associated respiratory bacterium 246-57"
## [7] "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Delftia;Delftia sp."
## [8] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Carnobacteriaceae;Desemzia;Desemzia incerta"
## [9] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovirhinis"
## [10] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Micrococcaceae;Arthrobacter;Arthrobacter arilaitensis"
# graphique montrant comment reduit l'OOB en fonction du nombre d'arbres generes
plot(fit$err.rate[, 1], type = "l", xlab = "nombre d'arbres", ylab = "erreur OOB")
M.haemolytica :fit <- randomForest(Y$Ct.M.haemolytica ~ ., data = X,ntree=1000,mtry=40)
fit$confusion
## 0 1 class.error
## 0 32 1 0.03030303
## 1 10 1 0.90909091
# variables qui discriminent le mieux:
varImpPlot(fit)
# liste des 10 variables qui discriminent le mieux:
indices<-as.numeric(gsub("var_","",names((fit$importance[order(fit$importance[, 1], decreasing = TRUE), ])[1:10])))
list_bacteria[indices]
## [1] "Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Stenotrophomonas;Stenotrophomonas maltophilia"
## [2] "Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Stenotrophomonas;Stenotrophomonas sp."
## [3] "Bacteria;Bacteroidetes;Flavobacteriia;Flavobacteriales;Flavobacteriaceae;Gelidibacter;unknown species"
## [4] "Bacteria;Firmicutes;Clostridia;Clostridiales;Family XII;Guggenheimella;unknown species"
## [5] "Bacteria;Proteobacteria;Alphaproteobacteria;Caulobacterales;Caulobacteraceae;Brevundimonas;Brevundimonas sp."
## [6] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella 7;unknown species"
## [7] "Bacteria;Actinobacteria;Actinobacteria;Pseudonocardiales;Pseudonocardiaceae;Saccharopolyspora;Saccharopolyspora rectivirgula"
## [8] "Bacteria;Actinobacteria;Actinobacteria;Actinopolysporales;Actinopolysporaceae;Actinopolyspora;Saccharopolyspora rectivirgula"
## [9] "Bacteria;Actinobacteria;Actinobacteria;Pseudonocardiales;Pseudonocardiaceae;Saccharomonospora;Saccharomonospora glauca K62"
## [10] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas aeruginosa"
# graphique montrant comment reduit l'OOB en fonction du nombre d'arbres generes
plot(fit$err.rate[, 1], type = "l", xlab = "nombre d'arbres", ylab = "erreur OOB")
M.bovis :fit <- randomForest(Y$Ct.M.bovis ~ ., data = X,ntree=500,mtry=40)
fit$confusion
## 0 1 class.error
## 0 40 0 0
## 1 4 0 1
# variables qui discriminent le mieux:
varImpPlot(fit)
# liste des 10 variables qui discriminent le mieux:
indices<-as.numeric(gsub("var_","",names((fit$importance[order(fit$importance[, 1], decreasing = TRUE), ])[1:10])))
list_bacteria[indices]
## [1] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovis"
## [2] "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Delftia;Delftia acidovorans"
## [3] "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Alcaligenaceae;Achromobacter;Achromobacter sp."
## [4] "Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Jeotgalicoccus;unknown species"
## [5] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Brucellaceae;Ochrobactrum;Bacillus sp."
## [6] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma alkalescens 14918"
## [7] "Bacteria;Actinobacteria;Actinobacteria;Pseudonocardiales;Pseudonocardiaceae;Saccharopolyspora;Saccharopolyspora rectivirgula"
## [8] "Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus succinus"
## [9] "Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Jeotgalicoccus;Jeotgalicoccus psychrophilus"
## [10] "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Alcaligenaceae;Alcaligenes;Alcaligenes sp."
# graphique montrant comment reduit l'OOB en fonction du nombre d'arbres generes
plot(fit$err.rate[, 1], type = "l", xlab = "nombre d'arbres", ylab = "erreur OOB")
H.somni :fit <- randomForest(Y$Ct.H.somni ~ ., data = X,ntree=1000,mtry=16)
fit$confusion
## 0 1 class.error
## 0 21 6 0.2222222
## 1 17 0 1.0000000
# variables qui discriminent le mieux:
varImpPlot(fit)
# liste des 10 variables qui discriminent le mieux:
indices<-as.numeric(gsub("var_","",names((fit$importance[order(fit$importance[, 1], decreasing = TRUE), ])[1:10])))
list_bacteria[indices]
## [1] "Bacteria;Firmicutes;Bacilli;Bacillales;Planococcaceae;Sporosarcina;Sporosarcina sp."
## [2] "Bacteria;Proteobacteria;Gammaproteobacteria;Pasteurellales;Pasteurellaceae;Histophilus;Histophilus somni"
## [3] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma dispar"
## [4] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovirhinis"
## [5] "Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Leptospiraceae;Leptospira;Leptospira broomii"
## [6] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Rhizobium;Neorhizobium huautlense"
## [7] "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Delftia;Delftia acidovorans"
## [8] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Moraxella;Moraxella bovoculi 237"
## [9] "Bacteria;Actinobacteria;Actinobacteria;Actinopolysporales;Actinopolysporaceae;Actinopolyspora;Saccharopolyspora rectivirgula"
## [10] "Bacteria;Fusobacteria;Fusobacteriia;Fusobacteriales;Leptotrichiaceae;unknown genus;unknown species"
# graphique montrant comment reduit l'OOB en fonction du nombre d'arbres generes
plot(fit$err.rate[, 1], type = "l", xlab = "nombre d'arbres", ylab = "erreur OOB")
## Random Forest sur la condition EN
On sépare les groupes
X_EN<-X[grep("EN",colnames(abundances)),]
Y_EN<-Y[grep("EN",colnames(abundances)),]
cat(colnames(Y_EN))
## Ct.RSV Ct.PI.3 Ct.Coronavirus Ct.P.multocida Ct.M.haemolytica Ct.M.bovis Ct.H.somni
fit <- randomForest(Y_EN$Ct.H.somni ~ ., data = X_EN,ntree=5000,mtry=40)
fit$confusion
## 0 1 class.error
## 0 12 2 0.1428571
## 1 8 0 1.0000000
Nothing in class 1…
# variables qui discriminent le mieux:
varImpPlot(fit)
# liste des 10 variables qui discriminent le mieux:
indices<-as.numeric(gsub("var_","",names((fit$importance[order(fit$importance[, 1], decreasing = TRUE), ])[1:10])))
list_bacteria[indices]
## [1] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Moraxella;unknown species"
## [2] "Bacteria;Firmicutes;Bacilli;Bacillales;Planococcaceae;Sporosarcina;Sporosarcina sp."
## [3] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Rhizobium;Neorhizobium huautlense"
## [4] "Bacteria;Bacteroidetes;Flavobacteriia;Flavobacteriales;Flavobacteriaceae;Gelidibacter;unknown species"
## [5] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Microbacteriaceae;Plantibacter;Plantibacter agrosticola"
## [6] "Bacteria;Actinobacteria;Actinobacteria;Actinomycetales;Actinomycetaceae;Trueperella;Trueperella pyogenes"
## [7] "Bacteria;Actinobacteria;Actinobacteria;Actinomycetales;Actinomycetaceae;Flaviflexus;unknown species"
## [8] "Bacteria;Actinobacteria;Coriobacteriia;Coriobacteriales;Coriobacteriaceae;Atopobium;Olsenella umbonata"
## [9] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Phyllobacteriaceae;unknown genus;unknown species"
## [10] "Bacteria;Proteobacteria;Gammaproteobacteria;Pasteurellales;Pasteurellaceae;Pasteurella;Pasteurella multocida"
# graphique montrant comment reduit l'OOB en fonction du nombre d'arbres generes
plot(fit$err.rate[, 1], type = "l", xlab = "nombre d'arbres", ylab = "erreur OOB")
Prédictions obtenues :
cat(fit$predicted)
## 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] randomForest_4.6-12 reshape2_1.4 metagenomeSeq_1.18.0
## [4] RColorBrewer_1.1-2 glmnet_2.0-13 foreach_1.4.3
## [7] Matrix_1.2-11 limma_3.32.10 Biobase_2.36.2
## [10] BiocGenerics_0.22.1 mixOmics_6.3.0 ggplot2_2.2.1
## [13] lattice_0.20-35 MASS_7.3-47
##
## loaded via a namespace (and not attached):
## [1] gtools_3.5.0 purrr_0.2.4 colorspace_1.2-4
## [4] htmltools_0.3.6 yaml_2.1.14 rlang_0.1.4
## [7] glue_1.2.0 bindrcpp_0.2 matrixStats_0.52.2
## [10] plyr_1.8.3 bindr_0.1 stringr_1.2.0
## [13] munsell_0.4.2 gtable_0.1.2 caTools_1.17.1
## [16] htmlwidgets_0.9 codetools_0.2-15 evaluate_0.10.1
## [19] labeling_0.3 knitr_1.17 httpuv_1.3.5
## [22] rARPACK_0.11-0 Rcpp_0.12.13 KernSmooth_2.23-15
## [25] xtable_1.8-2 corpcor_1.6.9 scales_0.5.0
## [28] backports_1.1.1 gdata_2.18.0 jsonlite_1.5
## [31] mime_0.5 RSpectra_0.12-0 gplots_3.0.1
## [34] gridExtra_2.3 ellipse_0.3-8 digest_0.6.12
## [37] stringi_1.1.5 dplyr_0.7.4 shiny_1.0.5
## [40] grid_3.4.1 rprojroot_1.2 bitops_1.0-6
## [43] tools_3.4.1 magrittr_1.5 rgl_0.98.1
## [46] lazyeval_0.2.1 tibble_1.3.4 tidyr_0.7.2
## [49] pkgconfig_2.0.1 assertthat_0.2.0 rmarkdown_1.7
## [52] iterators_1.0.8 R6_2.2.2 igraph_1.1.2
## [55] compiler_3.4.1